We are now entering the phase of doing serious statistics: we run regression models, inspect them, plot, and tabulate them.
For this exercise, let’s again pretend to be researchers interested in the determinants of the subjective likelihood estimation of being treated for COVID-19 in a hospital. We suspect that age plays an important role, something we have already assessed by visualizing a similar relationship. So… we need the GESIS Panel COVID-19 data again:
library(dplyr)
library(haven)
library(sjlabelled)
gp_covid <-
read_sav(
"../data/ZA5667_v1-1-0.sav"
) %>%
set_na(na = c(-1:-99, 97)) %>%
mutate(
likelihood_hospital = hzcy003a,
age_cat = as.factor(age_cat)
) %>%
remove_all_labels()
We start our analysis by checking the relationship with an ANOVA. We also include some very essential covariates: gender, political orientation, and marital status.
+.
serious_anova <-
aov(
likelihood_hospital ~ age_cat + sex + political_orientation + marstat,
data = gp_covid
)
summary(serious_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_cat 9 346 38.40 23.868 <2e-16 ***
## sex 1 1 0.99 0.617 0.432
## political_orientation 1 0 0.11 0.071 0.789
## marstat 1 1 1.06 0.657 0.418
## Residuals 3092 4974 1.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 660 observations deleted due to missingness
Alright, there’s definitely something in the data we should investigate. Sometimes, it is considered bad practice, but we could try to make a median cut for the dependent variable and run logistic regression.
dplyr::mutate() in combination with the ifelse() function.
gp_covid <-
gp_covid %>%
mutate(
likelihood_hospital_cut =
ifelse(
likelihood_hospital > median(likelihood_hospital, na.rm = TRUE),
1,
0
)
)
serious_logistic_regression <-
glm(
likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
data = gp_covid,
family = binomial(link = "logit")
)
summary(serious_logistic_regression)
##
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat, family = binomial(link = "logit"), data = gp_covid)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.259 -1.008 -0.756 1.242 2.070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1525329 0.3987424 -5.398 6.73e-08 ***
## age_cat2 0.5570465 0.4001281 1.392 0.16387
## age_cat3 0.2237200 0.4070216 0.550 0.58256
## age_cat4 0.8552502 0.3851614 2.220 0.02638 *
## age_cat5 1.0188140 0.3837725 2.655 0.00794 **
## age_cat6 1.2015271 0.3783196 3.176 0.00149 **
## age_cat7 1.5248309 0.3643355 4.185 2.85e-05 ***
## age_cat8 1.7367353 0.3739040 4.645 3.40e-06 ***
## age_cat9 1.9148385 0.3750669 5.105 3.30e-07 ***
## age_cat10 1.9574845 0.3732600 5.244 1.57e-07 ***
## sex 0.0687836 0.0777806 0.884 0.37652
## political_orientation 0.0247512 0.0206458 1.199 0.23059
## marstat -0.0007526 0.0481534 -0.016 0.98753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4067.1 on 3104 degrees of freedom
## Residual deviance: 3888.4 on 3092 degrees of freedom
## (660 observations deleted due to missingness)
## AIC: 3914.4
##
## Number of Fisher Scoring iterations: 4
# Interpretation:
# With increasing age, the probability of thinking it's likely to be
# hospitalized because of COVID-19 increases. But really?
Running one model is not enough. It may be that the assumptions for logistic regression are not fulfilled, or a reviewer simply doesn’t like these types of regressions. Instead, she proposes to run a binomial regression but with a cauchit link.
?family to see how you can include a cauchit link.
serious_cauchit_regression <-
glm(
likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
data = gp_covid,
family = binomial(link = "cauchit")
)
summary(serious_cauchit_regression)
##
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat, family = binomial(link = "cauchit"), data = gp_covid)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2523 -1.0064 -0.7601 1.2431 2.0485
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.548363 0.849671 -2.999 0.00271 **
## age_cat2 1.036283 0.875953 1.183 0.23680
## age_cat3 0.502607 0.912892 0.551 0.58193
## age_cat4 1.427898 0.852017 1.676 0.09376 .
## age_cat5 1.605221 0.848333 1.892 0.05846 .
## age_cat6 1.792792 0.843912 2.124 0.03364 *
## age_cat7 2.072772 0.837850 2.474 0.01336 *
## age_cat8 2.246052 0.840343 2.673 0.00752 **
## age_cat9 2.386031 0.840525 2.839 0.00453 **
## age_cat10 2.419927 0.840032 2.881 0.00397 **
## sex 0.046430 0.069014 0.673 0.50109
## political_orientation 0.017770 0.018113 0.981 0.32655
## marstat -0.004667 0.040372 -0.116 0.90796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4067.1 on 3104 degrees of freedom
## Residual deviance: 3889.1 on 3092 degrees of freedom
## (660 observations deleted due to missingness)
## AIC: 3915.1
##
## Number of Fisher Scoring iterations: 5
# Interpretation of difference:
# I don't know...
Ok, using different link functions is sometimes done as they provide different model fits. That’s definitely something we should investigate as well.
test = "LRT" to perform a likelihood ratio test. What’s your interpretation?
anova(
serious_logistic_regression,
serious_cauchit_regression,
test = "LRT"
)
## Analysis of Deviance Table
##
## Model 1: likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat
## Model 2: likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3092 3888.4
## 2 3092 3889.1 0 -0.72287
# Interpretation of difference:
# It seems there is no big difference. We can make the reviewer happy with the
# cauchit models and still keep our results.